Day 1 Session 2:
Dynamic spectral analysis
using Generalised Additive Mixed-Effect Models (GAMMs)

Introduction

In the previous sesseion, we tried running a statistical analysis using linear mixed-effect modelling. It was a static analysis based on a single point measurement of formant frequencies to characterise differences of the English /l/ and /ɹ/ quality produced by L1 Japanese and L1 English speakers.

While we observed some key between-group differences in the static data, the static analysis did not allow us to answer why such differences would occur. Specifically, it was not very unclear whether it is just the height of the formant frequencies that are different, or the formant frequency difference is just a small part of a broader, more fundamental difference.

More crucially, analysing the liquid consonants based on the static analysis lacks the consideration that English liquid consonants are inherently dynamic. This means that their spectral properties vary as a function of time, and a single-point measurement of formant frequencies is not adequate in characterising the English liquid quality. Also, English liquids show a strong interaction with the neighbouring vowels, so we need to consider how English liquids are coarticulated with the following vowel.

Taken together, we need to take the temporal dimension into account when analysing English liquids. The possible liquid-vowel interaction may mean that L1 Japanese speakers may produce the sequence of liquid consonant and a vowel with a different interaction pattern from that of L1 English speakers.

In this session, therefore, let’s try modelling the whole formant contours directly using Generalised Additive Mixed-Efefct Models (GAMMs).

Generalised Additive Mixed-Effect Models (GAMMs)

Preliminaries

Installing/loading packages

As always, let’s first install and load R packages that we are using in the static analysis section. The installation commands have been commented out, but please uncomment them and install the packages if you do not have them on your machine yet.

# installing packages
# install.packages("tidyverse")
# install.packages("mgcv")
# install.packages("itsadug")
# install.packages("tidymv")
# install.packages("tidygam")

# importing packages
library(tidyverse)
library(mgcv)
library(itsadug)
# library(tidymv) # for 'get_gam_predictions'
library(tidygam) # for 'get_gam_predictions'
source("https://raw.githubusercontent.com/soskuthy/gamm_intro/master/gamm_hacks.r")

Importing data set

Let’s import the data set. We are using the data set openly available on the Open Science Framework (OSF) repository.

# import the csv file "initial.liquid.dynamic.csv" from the "data" directory and save it as "df_dyn"
df_dyn <- readr::read_csv("data/initial.liquid.dynamic.csv")

Check data

Similarly to the static analysis, let’s spend some time inspecting the data set using colnames().

# Let's check what columns the data frame contains
colnames(df_dyn)
##  [1] "...1"           "file"           "speaker"        "language"      
##  [5] "f0"             "duration"       "segment"        "previous_sound"
##  [9] "next_sound"     "word"           "f1"             "f2"            
## [13] "f3"             "time"           "IsApproximant"  "IsAcoustic"    
## [17] "note"           "gender"         "omit"           "Barkf1"        
## [21] "Barkf2"         "Barkf3"         "f2f1"           "f3f2"          
## [25] "Barkf2f1"       "Barkf3f2"       "position"       "context"       
## [29] "liquid"         "input_file"

The majority of the variables in the data set are quite similar to the static data set. The dynamic data set also contains bark-normalised formant frequencies, but we won’t be using them here.

The crucial difference between the static and the dynamic data sets is the time column. As I explained earlier, the crucial aspect in the dynamic analysis is the time dimension, and we are interested in the time-varying characteristics in formant trajectories.

Let’s look into the time column in more detail. The following code displays what information is contained in the time column.

df_dyn |> 
  dplyr::group_by(time) |> 
  dplyr::summarise() |> 
  dplyr::ungroup()
## # A tibble: 11 × 1
##     time
##    <dbl>
##  1     0
##  2    10
##  3    20
##  4    30
##  5    40
##  6    50
##  7    60
##  8    70
##  9    80
## 10    90
## 11   100

According to the code above, the time dimension contains numbers from 0 to 100 with an increment of 10. The data set expresses the time dimension proportionally from 0% to 100%. This means that the beginning of an interval corresponds to 0% time point, and then we get time points 10%, 20%, 30% etc as time goes by, until 100% that corresponds to the end of an interval.

The next question here: what interval are we talking about? In the introduction section, I talked about the interaction between the liquid consonant and the following vowel. For this reason, we will treat the liquid and vowel intervals as one whole entity, in which each interval contains movement of the formant trajectories throughout the word-initial liquid and vowel intervals. This means that 0% time point corresponds to the onset of the word-initial liquid, and 100% time point corresponds to the end point of the following vowel.

Data wrangling

Omitting irrelavent columns

Let’s tidy up the data a little bit to avoid further confusion. As done with the static anlaysis, we will try to remove columns that are no longer necessary. The three columns, IsApproximant, IsAcoustic, and omit can safely be removed, as well as some spectral measures including Bark-normalised and distance measures as they can be confusing later on.

# Let's check the number of "approximant" tokens
df_dyn |> 
  dplyr::group_by(IsApproximant) |> 
  dplyr::summarise() |> 
  dplyr::ungroup()
## # A tibble: 1 × 1
##   IsApproximant
##   <chr>        
## 1 yes
# Let's check the number of tokens of good recording quality
df_dyn |> 
  dplyr::group_by(IsAcoustic) |> 
  dplyr::summarise() |> 
  dplyr::ungroup()
## # A tibble: 1 × 1
##   IsAcoustic
##   <chr>     
## 1 yes
# How about 'omit'?
df_dyn |> 
  dplyr::group_by(omit) |> 
  dplyr::summarise() |> 
  dplyr::ungroup()
## # A tibble: 1 × 1
##    omit
##   <dbl>
## 1     0
# Remove columns that we no longer need
df_dyn <- df_dyn |> 
  dplyr::select(-c(IsApproximant, IsAcoustic, omit, Barkf1, Barkf2, Barkf3, Barkf2f1, Barkf3f2, f2f1, f3f2))

Let’s check the column names again.

colnames(df_dyn)
##  [1] "...1"           "file"           "speaker"        "language"      
##  [5] "f0"             "duration"       "segment"        "previous_sound"
##  [9] "next_sound"     "word"           "f1"             "f2"            
## [13] "f3"             "time"           "note"           "gender"        
## [17] "position"       "context"        "liquid"         "input_file"

Note that we have a column called context: this can be useful for grouping so let’s keep them. But we can convert them into IPA symbols for a more intuitive representation:

# convert the ARPABET notation into IPA symbols
df_dyn <- df_dyn |> 
  dplyr::mutate(
    context = case_when(
      context == "AE" ~ "/æ/",
      context == "IY" ~ "/i/",
      context == "UW" ~ "/u/"
    )
  )

Checking the number of participants, tokens…

Let’s also obtain some descriptive statistics here. Note that we need to divide the number of rows by 11 to obtain the accurate number of tokens, as one token now has 11 time points.

# number of participants
df_dyn |> 
  dplyr::group_by(language) |> 
  dplyr::summarise(n = n_distinct(speaker)) |> 
  dplyr::ungroup()
## # A tibble: 2 × 2
##   language     n
##   <chr>    <int>
## 1 English     14
## 2 Japanese    31
# number of tokens per segment
df_dyn |> 
  dplyr::group_by(segment) |> 
  dplyr::summarise(n = n()/11) |> # divide by 11 time points
  dplyr::ungroup()
## # A tibble: 6 × 2
##   segment     n
##   <chr>   <dbl>
## 1 LAE1      511
## 2 LIY1      408
## 3 LUW1      299
## 4 RAE1      502
## 5 RIY1      481
## 6 RUW1      314

Your turn

Similarly in the static analysis, please explore the data a little to understand the data better.

You can start with checking the column names to see what variables are available in the data set. Then, use dplyr::group_by(), dplyr::summarise() and dplyr::ungroup() functions to inspect the data.

# Check data further
# ...

Data visualisation

Scaling formant frequencies

Now, let’s visualise the dynamic data. The basic procedure is the same as in the static analysis; We first apply z-score normalisation to the formant frequencies to make sure that formant values are comparable across speakers.

df_dyn <- df_dyn |> 
  dplyr::group_by(speaker) |> # tell R to do the following iteration per speaker
  dplyr::mutate(
    f1z = as.numeric(scale(f1)), # scale f1 into z-score
    f2z = as.numeric(scale(f2)), # scale f2 into z-score
    f3z = as.numeric(scale(f3)) # scale f3 into z-score
  ) |> 
  dplyr::ungroup() # don't forget ungrouping

Descriptive statistics

Let’s check the mean and SD for both raw and normalised formant values: just see F1 for now. Note that the mean z-scores do not seem to look zero, but this is because computers are not very good at dealing with very small numbers (e.g., decimals) and some fluctuations occur in computing the values.

# check mean and sd of raw/scaled F1 values for each speaker
df_dyn |> 
  dplyr::group_by(speaker) |>
  dplyr::summarise(
    f1_mean = mean(f1),
    f1_sd = sd(f1),
    f1z_mean = mean(f1z),
    f1z_sd = sd(f1z)
  ) |> 
  dplyr::ungroup() 
## # A tibble: 45 × 5
##    speaker f1_mean f1_sd  f1z_mean f1z_sd
##    <chr>     <dbl> <dbl>     <dbl>  <dbl>
##  1 2d57ke     468.  140. -9.05e-17   1   
##  2 2drb3c     575.  243.  2.39e-16   1   
##  3 2zy9tf     459.  228. -2.41e-16   1   
##  4 3bcpyh     438.  142. -5.72e-18   1.00
##  5 3hsubn     537.  177.  6.11e-17   1   
##  6 3pzrts     458.  195. -1.44e-18   1.00
##  7 4ps8zx     467.  172.  2.19e-16   1   
##  8 54i2ks     453.  192.  1.43e-16   1.00
##  9 5jzj2h     412.  133.  2.49e-16   1.00
## 10 5upwe3     444.  189. -6.51e-17   1.00
## # ℹ 35 more rows

Your turn

Write a code chunk to inspect the mean and sd of raw/scaled F2 and F3 values for each speaker and for speaker group. Think how you group the data in dplyr::group_by().

# check mean and sd of raw/scaled F2 values for each speaker group
# ...

# check mean and sd of raw/scaled F3 values for each speaker group
# ...

Visualisation

raw trajectories

Let’s visualise the formant trajectories here. In the dynamic analysis, it is almost always the case that we take the time dimension on the x-axis and the dependent variable on the y-axis. This allows us to see how e.g., F1 changes over time.

We also make sure about the variable grouping to tell ggplot2 how to organise the data. This can be done via group argument in the geom function. Each formant trajectory should come from one audio file, which is stored in the file column. We use this information for grouping.

# F1 - raw trajectories
df_dyn |> 
  ggplot(aes(x = time, y = f1z)) +
  geom_point(aes(colour = language, group = file), width = 0.3, alpha = 0.4) +
  geom_path(aes(colour = language, group = file), width = 0.3, alpha = 0.4) +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.5) +
  scale_colour_manual(values = cbPalette) + 
  facet_grid(liquid ~ context) +
  labs(x = "time", y = "F1 (z-normalised)", title = "time-varying change in F1 frequency") +
  theme(strip.text.y = element_text(angle = 0))

smooths

While I usually prefer just plotting raw trajectories because it is faithful to the nature of the data, I must admit that it is sometimes very difficult to see what’s going on there.

If you prefer, we could also just plot smooths to highlight the nonlinear between-group difference. The code below adds smoothed F1z trajectories to the raw data we just plotted (but I have commented out the raw trajectories for now). Note the difference in grouping; we used the file variable for the raw trajectories, but for smooths we need to use the language variable because we would like one smoothed trajectory for each L1 group.

# F1 - smooths
df_dyn |> 
  ggplot(aes(x = time, y = f1z)) +
  # geom_point(aes(colour = language, group = file), width = 0.3, alpha = 0.1) +
  # geom_path(aes(colour = language, group = file), width = 0.3, alpha = 0.1) +
  geom_smooth(aes(colour = language, group = language), linewidth = 1.2, se = TRUE) +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.5) +
  scale_colour_manual(values = cbPalette) + 
  facet_grid(liquid ~ context) +
  labs(x = "time", y = "F1 (z-normalised)", title = "smoothed time-varying change in F1 frequency") +
  theme(strip.text.y = element_text(angle = 0))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Your turn

Please try writing a code chunk to visualise (1) raw and (2) smoothed trajectories. You could try combining them both, too. Please feel free to explore your favourite colour schemes!

# F2 - raw trajectory
# ...

# F2 - smooths
# ...
# F3 - raw trajectory
# ...

# F3 - smooths
# ...

How do we model F2 trajectory?

I hope that you have enjoyed data visualisation! You’d know if you have tried plotting F2, but it seems that the F2 trajectories show very interesting between-group difference.

This is also a very interesting dimension to explore from the theoretical point of view, because previous research has claimed that L1 Japanese speakers would make it easier to acquire the use of F2 in a target-like manner. But the dynamic data suggests that L1 Japanese speakers do something really different from L1 English speakers. Let’s take a look at trajectories first below:

# F2
df_dyn |> 
  ggplot(aes(x = time, y = f2z)) +
  geom_point(aes(colour = language, group = file), width = 0.3, alpha = 0.1) +
  geom_path(aes(colour = language, group = file), width = 0.3, alpha = 0.1) +
  geom_smooth(aes(group = language), colour = "white", linewidth = 2.8, se = FALSE) +
  geom_smooth(aes(colour = language, group = language), linewidth = 1.8, se = TRUE) +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.5) +
  scale_colour_manual(values = cbPalette) + 
  facet_grid(liquid ~ context) +
  labs(x = "time", y = "F2 (z-normalised)", title = "raw/smoothed time-varying change in F2 frequency") +
  theme(strip.text.y = element_text(angle = 0))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

L1 English speakers follow somewhat consistent patterns across vowel contexts for both English /l/ and /ɹ/. They all start at a lower F2 at the beginning of the liquid onset (= 0%). The trajectories then go higher up to the maximal point in the middle of the interval (= around 50%). After reaching the highest peak, the trajectories go down towards the end of the interval (= 100%). Although there are some differences in terms of the timing that they achieve the maximal point, the overall patterns are fairly consistent.

L1 Japanese speakers, on the other hand, show distinct trajectory patterns across vowel contexts. In the /æ/ context, both English /l/ and /ɹ/ trajectories show an almost monotonic, linear decrease from the liquid onset (= 0%) towards the vowel offset (= 100%). The trajectories in the /u/ context are also similar. In these two vowel contexts, the English /ɹ/ trajectories seem to mark the highest point at around 25% time point, but it is not as pronounced as that of L1 English speakers.

Their /i/ trajectories, on the other hand, show a similar pattern to that of L1 English speakers. They start at a lower F2 value at the liquid onset (= 0%), go higher up, and then show a slight decrease towards the end of the interval. The timing of the maximal point, however, is quite early (i.e., at around 40-45% time point) compared to L1 English speakers.

Your turn

Please compare and discuss the static and dynamic plots for F2.

Note: The static analysis is based on the F2 measurement at the liquid midpoint. The dynamic analysis shows time-varying changes in F2 over liquid-vowel interval.

From linear to non-linear modelling: Introducting Generalised Additive Mixed Effects Models (GAMMs)

Modelling non-linearity in the data

Here, we would like to model the formant trajectory directory without relying just on a straight line. This can be illustrated by the plot below that shows the time-varying transition of F2 signals (in Hz) for the word reap produced by an L1 Japanese speaker.

As you can see in the plot below, the linear regression line (in yellow) does not capture the overall trend of the formant trajectory anymore. What we would ideally like is the non-linear curve in skyblue.

One solution to the non-linear modelling is via Generalised Additive (Mixed Effects) Models: GA(M)Ms. While GA(M)Ms are similar with the linear models in that they both can capture linear trends with parametric terms, GA(M)Ms also incorporate smooth terms that can capture non-linear relationships between the variables. Put it simply, GA(M)Ms is a combination (i.e., addition) of multiple (smooth) functions, which can be notated as:

\(y = \alpha + f(x) + \epsilon\)

where \(f(x)\) just means some function of \(x\) and \(\epsilon\) indicates an error term.

Basis functions

GA(M)Ms models the non-linearity in the data by combining multiple basis functions. The number of basis functions is related to the smoothness/wiggliness of a given GA(M)Ms model, such that, for example, there is more room for a GA(M)Ms trajectory to be smoother if the number of basis functions is high. Smoothness/wiggliness of a GA(M)Ms model is determined by the smoothing parameters, which GAM automatically estimate based on the data, and the number of basis function sets an upper limit as to how smooth a model can be.

What you can adjust in your model is the number of knots. A knot is a converging point between each of the basis functions, and the number of knots corresponds to the number of basis functions + 1. In the plot below, I show ten basis functions based on cubic regression splines, which means that there are eleven knots that I specify in the GAM model.

There are a few types of basis functions that can be specified in the GAMMs model. Common ones include thin plate regression splines (tp) and cubic regression splines (cr). They look quite different from each other, but resulting final smooth curve ends up very similar.

Parametric and smooth terms

Let’s now go back again to our data set and see how a GAMMs model can be specified on R. For example, the model below is a GAM model (without a random effect) that captures time-varying non-linearity along the F2 dimension:

mgcv::bam(f2z ~ language + context + s(time) + s(time, by = language) + s(time, by = context), data = df_dyn_L)

The model specification notation is somewhat similar to the one for the lme4::lmer() convention for the linear models, so hopefully this isn’t too complex for you.

As explained earlier, a GAM model can take two kinds of predictor variables: parametric and smooth terms. Parametric terms specify predictors that would influence a constant, overall influence on the dependent variable. This usually corresponds to the height of the trajectory. In the model above, this corresponds to language and context.

The new addition from the linear model would be smooth terms which allow GA(M)Ms to fit non-linear effects as a function of a variable. Smooth terms are notated as s(). This usually corresponds to the shape of the trajectory. And in the model above, you can see a series of s() terms, including s(time), s(time, by = language) and s(time, by = context). We will cover the specifics later, but basically these instruct GAM to model the non-linearity seen in F2 (dependint variable) as a function of time.

Let’s model F2 trajectory

Let’s try modelling a simple GAMMs model based on our data set. We start with a simple GAM model to model the effects of language and context on the F2 trajectory.

As usual, we’ll first subset the data and then convert the variables into factor, a similar procedure to the static analysis.

# subsetting data -- for /l/
df_dyn_L <- df_dyn |>
  dplyr::filter(liquid == "L")

# language variable
df_dyn_L$language <- as.factor(df_dyn_L$language)
levels(df_dyn_L$language)
## [1] "English"  "Japanese"
# vowel (context) variable
df_dyn_L$context <- as.factor(df_dyn_L$context)
levels(df_dyn_L$context)
## [1] "/æ/" "/i/" "/u/"
# speaker variable -- random effect
df_dyn_L$speaker <- as.factor(df_dyn_L$speaker)
levels(df_dyn_L$speaker)
##  [1] "2d57ke" "2drb3c" "2zy9tf" "3bcpyh" "3hsubn" "3pzrts" "4ps8zx" "54i2ks"
##  [9] "5jzj2h" "5upwe3" "6p63jy" "7cd4t4" "9c4efu" "bfwizh" "birw55" "bj8mjm"
## [17] "byxcff" "c5y8z6" "cdsju7" "cu2jce" "dbtzn2" "dcxuft" "ds6umh" "f9japd"
## [25] "fgd95u" "fkcwjr" "h5s4x3" "heat7g" "hgrist" "i3wa7f" "jcy8xi" "kjn9n4"
## [33] "m46dhf" "m5r28t" "s6a8gh" "srky8g" "tay55n" "uig6n9" "ut4e5m" "we8z58"
## [41] "wrgwc3" "xub9bc" "z3n578" "zajk25" "zz3r2g"
# word variable -- random effect
df_dyn_L$word <- as.factor(df_dyn_L$word)
levels(df_dyn_L$word)
## [1] "lamb"  "lamp"  "lap"   "leaf"  "leap"  "leave" "loom"  "lube"

A simple (linear) model only with parametric terms

What if we just model F2 only with parametric terms? Let’s just start with language variable.

# fit a model with parametric terms only
m1 <- bam(f2z ~ language, data = df_dyn_L, method = "fREML")

# model summary
summary(m1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## f2z ~ language
## 
## Parametric coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       0.04019    0.01355   2.965  0.00303 **
## languageJapanese -0.02048    0.01773  -1.155  0.24803   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## R-sq.(adj) =  2.5e-05   Deviance explained = 0.00996%
## -REML =  19166  Scale est. = 1.0225    n = 13398

The model output displays a section called Parametric coefficients which shows how each of the paraemetric term has an overall effect on the F2 values. But overall, the output looks pretty much the same with the linear mixed-effect modelling using lme4::lmer().

Adding non-linearity

Our first model m1 models the constant effect on the language variable on the F2 values; that is, we could ask whether L1 English and L1 Japanese speakers are overall different in F2 frequencies. But also, as we saw earlier in data visualisation, there was a lot going on beyond the constant between-group difference.

Let’s extend the model so that we can model the non-linear between-group difference. Here, we add a new smooth term s(time) that allows us to model the non-linear difference between L1 English and L1 Japanese speakers in the F2 values over time.

# a model with a parametric and a smooth term
m2 <- mgcv::bam(f2z ~ language + s(time, by = language), data = df_dyn_L, method = "fREML")

# model summary
summary(m2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## f2z ~ language + s(time, by = language)
## 
## Parametric coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       0.03474    0.01243   2.796  0.00519 **
## languageJapanese -0.01566    0.01625  -0.963  0.33549   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                            edf Ref.df      F p-value    
## s(time):languageEnglish  6.700  7.834 284.86  <2e-16 ***
## s(time):languageJapanese 5.141  6.258  49.89  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.16   Deviance explained =   16%
## fREML =  18022  Scale est. = 0.8594    n = 13398

The top half of the model summary looks similar to the one for m1. A new addition here is Approximate significance of smooth terms; this is where we can evaluate the performance of the smooth terms. There are two columns that you may not be familar with, but edf would be the most important here.

  • edf stands for effective degree of freedom. This signifies how many parameters are needed to represent the smooth.

    • Edf indexes the degree of non-linearity of the smooth. An edf value closer to 1 means that the pattern modelled with s() is linear.

Visualising GAM

Unlike linear models, interpretation of non-linear patterns is quite tricky and complicated. So, data visualiastion is crucial in the modelling of non-linear relationships. Let’s visualise the GAM model that we just modelled.

The package itsadug lets you visualise the F2 smooths for each language group (L1 English and L1 Japanese speakers) as follows:

itsadug::plot_smooth(m2, view = "time", plot_all = "language", rug = FALSE)
## Summary:
##  * language : factor; set to the value(s): English, Japanese. 
##  * time : numeric predictor; with 30 values ranging from 0.000000 to 100.000000. 
##  * NOTE : No random effects in the model to cancel.
## 

It is also possilbe to visualise the difference in F2 trajectories between the two speaker groups:

itsadug::plot_diff(m2, view = "time", com = list(language = c("English", "Japanese")))
## Summary:
##  * time : numeric predictor; with 100 values ranging from 0.000000 to 100.000000. 
##  * NOTE : No random effects in the model to cancel.
## 

## 
## time window(s) of significant difference(s):
##  0.000000 - 35.353535
##  41.414141 - 100.000000

We start to see some interesting bits here. The first plot suggests that, overall, L1 Japanese speakers (in light blue) show a flat F2 trajectory compared to L1 English speakers (in light pink). The difference between the two trajectories lie almost throughout the time course; the two trajectories cross over each other at approximately 40% time point, where we find little difference, but otherwise the confidence interval for the difference smooth is not overlapping zero consistently.

Model comparison

There are a couple of ways of statistical significance testing, but it is always recommended to do this via model comparison. The idea is similar to when we fit linear models, with only a few differences:

  • Instead of using anova(), we use itsadug::compareML() function.

  • Also, make sure that your GA(M)Ms model is run based on the Maximal Likelihood (ML) estimation method. This can be specified by the argument method = "ML". Whereas restricted ML (REML) and fast restricted ML (fREML) are more efficient computationally, these cannot be used for model comparison.

Here, let’s compare whether an addition of the smooth term results in a better model fit.

# re-fitting the first model with the Maximal Likelihood estimation
m1 <- mgcv::bam(f2z ~ language, data = df_dyn_L, method = "ML")

# re-fitting the first model with the Maximal Likelihood estimation
m2 <- mgcv::bam(f2z ~ language + s(time, by = language), data = df_dyn_L, method = "ML")

# model comparison
## full description
itsadug::compareML(m1, m2, print.output = TRUE, suggest.report = TRUE)
## m1: f2z ~ language
## 
## m2: f2z ~ language + s(time, by = language)
## 
## Report suggestion: The Chi-Square test on the ML scores indicates that model m2 is [significantly] better than model m1 (X2(4.00)=1146.117, p<2e-16).
## -----
##   Model    Score Edf Difference    Df  p.value Sig.
## 1    m1 19158.85   2                               
## 2    m2 18012.74   6   1146.117 4.000  < 2e-16  ***
## 
## AIC difference: 2315.23, model m2 has lower AIC.
## brief output
itsadug::compareML(m1, m2, print.output = FALSE)$table
##   Model    Score Edf Difference    Df  p.value Sig.
## 1    m1 19158.85   2                               
## 2    m2 18012.74   6   1146.117 4.000  < 2e-16  ***

The output again looks similar to the case of linear model comparisons using anova(). It says that m2 has a lower Akaike Infomration Criterion (AIC) scores than m1. It also shows that m2 has a lower (ML) score of 18012.74 compared to 19158.85 for m1. These two measures index the degree of model fit, and the lower, the better. Overall, the model comparison suggests that the model with a smooth m2 significantly improves the degree of model fit and is thus considered to be a better model than the model with parametric terms only m1.

Modelling difference smooths

As we are dealing with categorical predictors, it is possible to encode the non-linear difference into the model directly through difference smooths. Whereas we could stick to model two different smooths for each level in a predictor variable, we would ultimately know whether the two trajectories are statistically significantly different. The approach with difference smooth allows us to directly evaluate whether the between-group difference is significant or not.

To do this, we first need to convert the predictor variables into ordered variables, a special case of factor variables. Don’t forget to execute both as.ordered() and "contr.tretment" as they are important in GAMMs.

# converting language variable into ordered variable
df_dyn_L$language.ord <- as.ordered(df_dyn_L$language)
contrasts(df_dyn_L$language) <- "contr.treatment"

When trying to model with difference smooths, you need to specify the following:

  • a parametric term of the variable

  • a smooth term of time without any specification

  • a smooth over time with the grouping specificaion by = language.ord

So, the model would look like this:

# model
m3 <- mgcv::bam(f2z ~ language.ord + s(time) + s(time, by = language.ord), data = df_dyn_L, method = "fREML")

# summary
summary(m3)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## f2z ~ language.ord + s(time) + s(time, by = language.ord)
## 
## Parametric coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.026913   0.008127   3.311 0.000931 ***
## language.ord.L -0.011066   0.011494  -0.963 0.335656    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                edf Ref.df     F p-value    
## s(time)                      7.121  8.055 276.6  <2e-16 ***
## s(time):language.ordJapanese 5.840  6.906 102.9  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.16   Deviance explained =   16%
## fREML =  18026  Scale est. = 0.85941   n = 13398

Let’s take a look at parametric coefficients first. It says that the intercept is statistically significant, meaning that it’s different from zero. language.ord.L does not show statistical significance, which would mean that there is no overall effect of language on the F2 values.

Let’s now move to the smooth terms. The edf values are quite bigger than 1 for both smooth terms, indicating that they capture non-linearity. s(time) is statistically significant, meaning that time is a strong predictor of f2z. In other words, time is not just linearly related to f2z, thus indicating the non-linearity. Also, the statistical significance of s(time):language.ordJapanese means that the relationship between time and f2z depends on the language.ord variable. That is, L1 English and L1 Japanese speakers show different non-linear trends in f2z over time at statistically significant level.

Comparison between m2 and m3

Let’s compare the output of m3 with that of m2 that we saw earlier. For m2, we fitted separate smooths for each level in language variable, and the model output only shows whether each smooth is significantly different from zero. I’ve reiterated the model specification here for convenience:

# a model with a parametric and a smooth term
m2 <- mgcv::bam(f2z ~ language + s(time, by = language), data = df_dyn_L, method = "fREML")

# model summary
summary(m2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## f2z ~ language + s(time, by = language)
## 
## Parametric coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       0.03474    0.01243   2.796  0.00519 **
## languageJapanese -0.01566    0.01625  -0.963  0.33549   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                            edf Ref.df      F p-value    
## s(time):languageEnglish  6.700  7.834 284.86  <2e-16 ***
## s(time):languageJapanese 5.141  6.258  49.89  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.16   Deviance explained =   16%
## fREML =  18022  Scale est. = 0.8594    n = 13398

Focussing on the smooth term table, it has two lines: s(time):languageEnglish and s(time):languageJapanese. Both show statistical significance, meaning that f2 and time interacts non-linearly for both L1 English and L1 Japanese speakers. But this does not necessarily tell us whether the two groups are different. This is why modelling with difference smooths can be handy when the primary interest lies in comparing between-group differences in trajectory pattern.

Adding context

So far, we have modelled the non-linear relationships between time and f2z only with the language variable. Let’s add the context variable here as well.

Your turn

Sticking to the difference smooths approach, could you follow the procedures below and fit a model with language and context variables?

Converting the context variable into ordered variables

# converting language variable into ordered variable
# df_dyn_L$context.ord <- ...
# don't forget the second line

Fitting the model

Please define a model with language.ord and context.ord, incorporated into both parametric and smooth terms, and save the model as m4.

# fitting a model with language.ord and context.ord
# m4 <- ...

# model summary
# ...

Model comparison

The practice here allows us to proceed onto model comparison between the full model with both language.ord and context.ord, and one without either of these. Could you come up with a good way to do this? Would you be able to say what effects you are testing through model comparison?

You can re-use m3 and m4. In addition, you might also want to specify m5. Don’t also forget about changing the estimation method.

# model with language.ord
m3 <- mgcv::bam(f2z ~ language.ord + s(time) + s(time, by = language.ord), data = df_dyn_L, method = "fREML")

# model with both language.ord and context.ord
m4 <- mgcv::bam(f2z ~ language.ord + context.ord + s(time) + s(time, by = language.ord) + s(time, by = context.ord), data = df_dyn_L, method = "fREML")

# model with context.ord
# m5 <- ...

# comparison
# ...

Significance testing

We’re close to building the full model! Now, let’s focus on the statistical significance of the predictor variables, such as language.ord and context.ord.

In GAMMs, non-linear trajectories are captured through smooth terms, while parametric terms represent linear effects. The trajectories can be analyzed in terms of height (level) and shape (rate of change), which correspond to parametric and smooth terms, respectively.

Here’s how to assess the significance of each term:

1. Full Model vs. Nested Model (with Both Parametric and Smooth Terms)

To determine the overall significance of a variable’s effect (both on height and shape):

  • Compare the full model (which includes both the parametric and smooth terms for the variable of interest) with a nested model that excludes both the parametric and smooth terms associated with that variable.

Interpretation:

  • If the full model improves the fit significantly (using a likelihood ratio test or other model comparison methods), this suggests that the variable of interest affects both the trajectory height and the trajectory shape.

  • If the full model does not improve the fit significantly, this suggests that the variable of interest does not significantly affect the dependent variable (either in terms of height or shape).

2. Full Model vs. Nested Model (with Parametric Terms Only)

If the full model is found to be significantly better than the nested model in step 1, we can further investigate whether the variable affects both the height and shape of the trajectory:

  • Compare the full model (which includes both parametric and smooth terms for the variable) with another nested model that retains the parametric term but excludes the smooth term for the variable of interest.

Interpretation:

  • If the full model is still significantly better, this suggests that the variable has a significant effect on both the height (parametric term) and shape (smooth term) of the trajectory.

  • If the full model is not significantly better, this suggests that the variable’s effect is primarily on the height dimension (the parametric term) and that the shape (non-linear trajectory) is not significantly influenced by the variable.

Let’s see how this works for the language.ord effect:

Comparison 1: full model vs nested model without parametric/smooth terms

# full model
m4 <- mgcv::bam(f2z ~ language.ord + context.ord + s(time) + s(time, by = language.ord) + s(time, by = context.ord), data = df_dyn_L, method = "ML")

# nested model 1 -- excluding both parametric and smooth terms
m4_1 <- mgcv::bam(f2z ~ context.ord + s(time) + s(time, by = context.ord), data = df_dyn_L, method = "ML")

# model comparison
compareML(m4, m4_1, suggest.report = FALSE)
## m4: f2z ~ language.ord + context.ord + s(time) + s(time, by = language.ord) + 
##     s(time, by = context.ord)
## 
## m4_1: f2z ~ context.ord + s(time) + s(time, by = context.ord)
## 
## Chi-square test of ML scores
## -----
##   Model    Score Edf Difference    Df  p.value Sig.
## 1  m4_1 11133.99   9                               
## 2    m4 10135.37  12    998.615 3.000  < 2e-16  ***
## 
## AIC difference: -2018.31, model m4 has lower AIC.

The model comparison suggests that the full model m4 significantly improves the model fit. Let’s now move onto another model comparison.

Comparison 2: full model vs nested model only with parametric term

# full model
m4 <- mgcv::bam(f2z ~ language.ord + context.ord + s(time) + s(time, by = language.ord) + s(time, by = context.ord), data = df_dyn_L, method = "ML")

# nested model 2 -- including parametric term only
m4_2 <- mgcv::bam(f2z ~ language.ord + context.ord + s(time) + s(time, by = context.ord), data = df_dyn_L, method = "ML")

# model comparison
compareML(m4, m4_2, suggest.report = FALSE)
## m4: f2z ~ language.ord + context.ord + s(time) + s(time, by = language.ord) + 
##     s(time, by = context.ord)
## 
## m4_2: f2z ~ language.ord + context.ord + s(time) + s(time, by = context.ord)
## 
## Chi-square test of ML scores
## -----
##   Model    Score Edf Difference    Df  p.value Sig.
## 1  m4_2 11132.36  10                               
## 2    m4 10135.37  12    996.989 2.000  < 2e-16  ***
## 
## AIC difference: -2017.05, model m4 has lower AIC.

The full model still improves the degree of model fit. This suggests that language.ord has a statistically significant difference on both height and shape dimensions.

Your turn

How about the context.ord?

Comparison 1: full model vs nested model without parametric/smooth terms

# full model
m4 <- mgcv::bam(f2z ~ language.ord + context.ord + s(time) + s(time, by = language.ord) + s(time, by = context.ord), data = df_dyn_L, method = "ML")

# nested model 1 -- excluding both parametric and smooth terms
# m4_3 <- ...

# model comparison
# compareML(m4, m4_3, suggest.report = FALSE)

Comparison 2: full model vs nested model only with parametric term

Please write a code chunk if another model comparison is still necessary.

# full model
m4 <- mgcv::bam(f2z ~ language.ord + context.ord + s(time) + s(time, by = language.ord) + s(time, by = context.ord), data = df_dyn_L, method = "ML")

# nested model 2 -- including parametric term only
# m4_4 <-

# model comparison
# compareML(m4, m4_4, suggest.report = FALSE)

Random effects

You might notice that we have modelled non-linear change of f2z over time with fixed effects only, and we have ignored by-speaker and by-item random effects. This is where the difference between GAM and GAMM comes in: Generalised Additive Mixed Effect Models, as the name suggests, can also incorporate random effects!

There are various ways to incorporate random effects, but here, we stick to one of them, which is based on factor smooths.

Factor smooths fit curves for indiviaul participants (if specified for speaker, participant etc) and for individual items (if specified for word, item etc). You could think of this as a non-linear equivalent of random intercepts and random slopes in the linear mixed-effect models.

Let’s start with accounting for the by-speaker factor smooths. In the model below, a new smooth term is added: s(time, speaker, bs = "fs", m = 1). This tells the model to fit non-linear trajectories for f2z over time for each of the speaker(s). bs = "fs" specifies factor smooths, and the final m = 1 is a control parameter of the degree of wiggliness of by-speaker trajectory – we leave it as it is.

# full model with by-speaker random effects
m5 <- mgcv::bam(f2z ~ language.ord + context.ord + s(time) + s(time, by = language.ord) + s(time, by = context.ord) + s(time, speaker, bs = "fs", m = 1), data = df_dyn_L, method = "fREML")
## Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
## 1-d smooths of same variable.
# model summary
summary(m5)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## f2z ~ language.ord + context.ord + s(time) + s(time, by = language.ord) + 
##     s(time, by = context.ord) + s(time, speaker, bs = "fs", m = 1)
## 
## Parametric coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.481150   0.022390 -21.490   <2e-16 ***
## language.ord.L  0.021916   0.030942   0.708   0.4788    
## context.ord/i/  1.515216   0.009583 158.119   <2e-16 ***
## context.ord/u/  0.022558   0.010410   2.167   0.0302 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                  edf  Ref.df       F p-value    
## s(time)                        7.378   7.926  43.520  <2e-16 ***
## s(time):language.ordJapanese   6.580   7.278  27.036  <2e-16 ***
## s(time):context.ord/i/         6.781   7.853 494.595  <2e-16 ***
## s(time):context.ord/u/         4.206   5.164  27.335  <2e-16 ***
## s(time,speaker)              235.149 403.000   6.374  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.783   Deviance explained = 78.8%
## fREML = 9235.1  Scale est. = 0.22161   n = 13398

Compared to the model outputs earlier, there is one extra line in the smooth term section. s(time,speaker) at the bottom indicates the smooth term to account for by-speaker variation. This shows statistical significance, meaning that this is good to be included.

And this is what by-speaker factor smooths looks like:

Your turn

In a similar manner to the by-speaker random effects, could you write a code of a model containing both by-speaker and by-item (operationalised here as word) random effects?

# full model with by-speaker and by-item random effects
# m6 <- ...

# model summary
# ...

Also, how do the by-item factor smooths look?

# drawing by-item factor smooths
# itsadug::inspect_random(m6, select = 6, lwd = 3, main = "random smooths by item")

Correcting autocorrelation

Looks like we’re pretty much all set – well done on getting this far! However, there is one more important aspect to consider, and we’ll need to look into the assumptions under which GAMMs operate.

If you remember from the linear models, I explained that statistcal tests have assumptions that need to be met. And it concerned with residuals. Residuals mean the difference between observed and fitted values: in linear models, casually speaking, we should not expect any visible structure in the residual distributions under the normality and constant assumption.

Something similar can be said here: residuals need to be independent from each other for GAMMs models, too. In a simpler term, the model does not know (yet) whether data points are completely independent of each other or sampled from a series of time series data. This has an important consequence in estimation accuracy, so we need to address this.

The solution is to incorporate AR(1) model into your GAMMs model. Through adding AR(1) model, we define:

  1. an extra parameter rho (\(\rho\))

  2. an extra column in the data set telling where a trajectory starts (start.event)

Let’s go through the process to build and incorporate the AR(1) model.

Check autocorrelation

Let’s check the residual structure in our full model m6. This can be done via itsadug::acf_resid() function.

# showing residual autocorrelations
itsadug::acf_resid(m6)

The plot shows residual autocorrelation at different lags. For example, the autocorrelation at lag 0 (i.e., “0” on the x-axis) shows the value of 1, and this makes sense because lag 0 shows the correlation of the residuals with themselves. The residual correlation at Lag 1 is calculated by taking the f2z values at the time 0, 1, 2, … and correlating them with the f2z values at time 1, 2, 3 … . Similarly, lag 2 is the correlation between correlations of the f2z values at the time 0, 1, 2 … and at the time 2, 3, 4 … .

This is somewhat complicated, but the important thing to look for in the plot is the fact that the heights of the two vertical lines next to each other are correlated. This suggests that residuals are correlated with each other and this is not ideal for accurate modelling estimation.

Defining rho (\(\rho\))

In order to address the residual autocorrelation, we will first define a rho value. It is advisable that you explore a wide range of rho values, but a rule of thumb is to define the rho value as the autocorrelation at lag 1.

# obtain the autocorrelation at lag 1 as a rho value
rho_m6 <- start_value_rho(m6)

The autocorrelation at lag 1 is approximately 0.783, and we use this as a rho value in the model.

Defining the start of an event

Another thing we need to do is to define the beginning of each trajectory. This simply marks the row in the data corresponding to time = 0 as TRUE.

Note: When doing this, make sure that your data is ordered appropriately. In our case here, the time sequence needs to be arranged from 0, 10, 20, … 100 for each token.

# let's arrange the data first
df_dyn_L <- df_dyn_L |> 
  dplyr::arrange(
    speaker, file, time
  ) |> 
  dplyr::relocate(time) # to move the time column forward

# define the event onset
df_dyn_L$start.event <- df_dyn_L$time == 0

# check data
df_dyn_L |> 
  dplyr::select(speaker, file, time, f2z, start.event)
## # A tibble: 13,398 × 5
##    speaker file                    time    f2z start.event
##    <fct>   <chr>                  <dbl>  <dbl> <lgl>      
##  1 2d57ke  JP_2d57ke_lamb033_0001     0 -0.701 TRUE       
##  2 2d57ke  JP_2d57ke_lamb033_0001    10 -0.679 FALSE      
##  3 2d57ke  JP_2d57ke_lamb033_0001    20 -0.630 FALSE      
##  4 2d57ke  JP_2d57ke_lamb033_0001    30 -0.679 FALSE      
##  5 2d57ke  JP_2d57ke_lamb033_0001    40 -0.668 FALSE      
##  6 2d57ke  JP_2d57ke_lamb033_0001    50 -0.648 FALSE      
##  7 2d57ke  JP_2d57ke_lamb033_0001    60 -0.632 FALSE      
##  8 2d57ke  JP_2d57ke_lamb033_0001    70 -0.624 FALSE      
##  9 2d57ke  JP_2d57ke_lamb033_0001    80 -0.668 FALSE      
## 10 2d57ke  JP_2d57ke_lamb033_0001    90 -0.948 FALSE      
## # ℹ 13,388 more rows

As you can see above, the data are arranged according to time within each token, and the onset of a trajectory (i.e., where time = 0) is marked TRUE in the start.event column.

Fitting the full model with AR(1) model

Finally, let’s incorporate the residual autocorrelation correction (i.e., AR(1) model) into our model.

# full model with AR(1) model
m7 <- mgcv::bam(f2z ~ language.ord + context.ord + s(time) + s(time, by = language.ord) + s(time, by = context.ord) + s(time, speaker, bs = "fs", m = 1) + s(time, word, bs = "fs", m = 1), data = df_dyn_L, method = "fREML", rho = rho_m6, AR.start = df_dyn_L$start.event)
## Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
## 1-d smooths of same variable.
# model summary
summary(m7)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## f2z ~ language.ord + context.ord + s(time) + s(time, by = language.ord) + 
##     s(time, by = context.ord) + s(time, speaker, bs = "fs", m = 1) + 
##     s(time, word, bs = "fs", m = 1)
## 
## Parametric coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.48064    0.03749 -12.819   <2e-16 ***
## language.ord.L  0.02145    0.02866   0.749    0.454    
## context.ord/i/  1.51032    0.04647  32.499   <2e-16 ***
## context.ord/u/  0.02278    0.05174   0.440    0.660    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                  edf  Ref.df      F p-value    
## s(time)                        7.241   7.396 19.887  <2e-16 ***
## s(time):language.ordJapanese   7.566   7.842 28.637  <2e-16 ***
## s(time):context.ord/i/         6.592   6.792 16.342  <2e-16 ***
## s(time):context.ord/u/         1.000   1.001  0.542   0.462    
## s(time,speaker)              312.889 403.000  5.498  <2e-16 ***
## s(time,word)                  52.113  69.000 15.366  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.796   Deviance explained = 80.2%
## fREML = 1118.8  Scale est. = 0.15231   n = 13398

Let’s also check whether autocorrelation is accounted for.

# showing residual autocorrelations
itsadug::acf_resid(m7)

The autocorrelation plot shows that the autocorrelation at lag 1 (i.e., second vertical line from the left) is substantially shorter than in the previous plot. This means that the new model m7 successfully recognises the dependencies between data points and thus the residual autocorrelations have been eliminated.

Model criticism

We have touched upon some assumptions just now, and it’s a good practice to check whether our model satisfies those assumptions.

# model criticism
mgcv::gam.check(m7)

## 
## Method: fREML   Optimizer: perf newton
## full convergence after 17 iterations.
## Gradient range [-3.156672e-05,4.419978e-05]
## (score 1118.832 & scale 0.1523108).
## Hessian positive definite, eigenvalue range [3.155692e-05,6698.08].
## Model rank =  517 / 517 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                                  k'    edf k-index p-value
## s(time)                        9.00   7.24       1    0.59
## s(time):language.ordJapanese   9.00   7.57       1    0.59
## s(time):context.ord/i/         9.00   6.59       1    0.57
## s(time):context.ord/u/         9.00   1.00       1    0.55
## s(time,speaker)              405.00 312.89       1    0.52
## s(time,word)                  72.00  52.11       1    0.55

The plots here look overall OK, but not ideal. The tails on both ends of the ‘spaghetti-like’ line in the first plot suggests that the residual distribution does not strictly follow the normal distribution: Rather, it suggests that it follows a t-distribution. The histogram also shows somewhat skewed distribution.

We won’t try correcting these issues in the interest of time, but existing GAMMs tutorials suggest some ways to address this. One way, for example, is to tell your GAMMs model to follow a scaled t distribution by adding family = 'scat' argument. But for now, let’s leave it as it is.

Final model: data visualisation

We have covered quite a few things so far and we have not managed to visualise the full model. Let’s do this finally.

# full model with AR(1) model
m7 <- mgcv::bam(f2z ~ language.ord + context.ord + s(time) + s(time, by = language.ord) + s(time, by = context.ord) + s(time, speaker, bs = "fs", m = 1) + s(time, word, bs = "fs", m = 1), data = df_dyn_L, method = "fREML", rho = rho_m6, AR.start = df_dyn_L$start.event)
## Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
## 1-d smooths of same variable.
# visualisation
## /æ/ context
### separate smooths
plot_smooth(m7, view = "time", plot_all = "language.ord", cond = list(context.ord = "/æ/"))
## Summary:
##  * language.ord : factor; set to the value(s): English, Japanese. 
##  * context.ord : factor; set to the value(s): /æ/. 
##  * time : numeric predictor; with 30 values ranging from 0.000000 to 100.000000. 
##  * speaker : factor; set to the value(s): ds6umh. (Might be canceled as random effect, check below.) 
##  * word : factor; set to the value(s): lamb. (Might be canceled as random effect, check below.) 
##  * NOTE : The following random effects columns are canceled: s(time,speaker),s(time,word)
## 

### difference smooth
plot_diff(m7, view = "time", comp = list(language.ord = c("Japanese", "English")), cond = list(context.ord = "/æ/"), print.summary = FALSE, main = "Japanese-English difference in the /æ/ context")

### Your turn

Could you write a code chunk to visualise (1) separate and (2) difference smooths between L1 English and L1 Japanese spaekers in the /i/ and /u/ contexts?

# visualisation
## /i/ context
### separate smooths
# ...

### difference smooth
# ...

## /u/ context
### separate smooths
# ...

### difference smooth
# ...

Interaction

Finally, let’s briefly think about how to model interactions between language and context. Modelling interactions in GAMMs models is highly complex, and I did not try this in my publication. I still do not know how to perform significance testing via model comparison, we could at least try modelling and inspecting the model summary.

This section can also showcase as a complete analysis workflow based on things that we’ve covered so far.

Creating the LangCont variable

One way of modelling the categorical \(\times\) categorical interaction in GAMMs is to concatenate two variables into one. This can be done via interaction function.

# convert variables into factor
df_dyn_L$language <- as.factor(df_dyn_L$language)
df_dyn_L$context <- as.factor(df_dyn_L$context)
df_dyn_L$speaker <- as.factor(df_dyn_L$speaker)
df_dyn_L$word <- as.factor(df_dyn_L$word)

# create an interaction variable
df_dyn_L$LangCont <- interaction(df_dyn_L$language, df_dyn_L$context)

To quickly reiterate, language has two levels (Japanese vs English) and context has three levels (/æ/ vs /i/ vs /u/). By combining them together, LangCont can fit a trajectory for each of the six levels (i.e., two for language \(\times\) three for context):

# checking the levels in the LangCont column
df_dyn_L |> 
  dplyr::group_by(LangCont) |> 
  dplyr::summarise() |> 
  dplyr::ungroup()
## # A tibble: 6 × 1
##   LangCont    
##   <fct>       
## 1 English./æ/ 
## 2 Japanese./æ/
## 3 English./i/ 
## 4 Japanese./i/
## 5 English./u/ 
## 6 Japanese./u/

This way, we can model the interaction while treating LangCont as just one variable in the model.

Fitting the first model

We’ll stick to the random smooths approach, so let’s convert `LangCont into an ordered variable.

# converting LangCont into ordered variable
df_dyn_L$LangCont.ord <- as.ordered(df_dyn_L$LangCont)
contrasts(df_dyn_L$LangCont.ord) <- "contr.treatment" # don't forget this!

# fitting a model
m10 <- mgcv::bam(f2z ~ LangCont.ord + s(time) + s(time, by = LangCont.ord) + s(time, speaker, bs = "fs", m = 1) + s(time, word, bs = "fs", m = 1), data = df_dyn_L, method = "fREML")
## Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
## 1-d smooths of same variable.
# summary
summary(m10)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## f2z ~ LangCont.ord + s(time) + s(time, by = LangCont.ord) + s(time, 
##     speaker, bs = "fs", m = 1) + s(time, word, bs = "fs", m = 1)
## 
## Parametric coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -0.29612    0.05487  -5.397 6.89e-08 ***
## LangCont.ordJapanese./æ/ -0.29689    0.05076  -5.849 5.07e-09 ***
## LangCont.ordEnglish./i/   1.11191    0.05163  21.534  < 2e-16 ***
## LangCont.ordJapanese./i/  1.51342    0.07141  21.194  < 2e-16 ***
## LangCont.ordEnglish./u/  -0.21483    0.05763  -3.728 0.000194 ***
## LangCont.ordJapanese./u/ -0.10310    0.07578  -1.361 0.173657    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                      edf  Ref.df      F  p-value    
## s(time)                            7.139   7.477 23.657  < 2e-16 ***
## s(time):LangCont.ordJapanese./æ/   6.604   7.466 47.542  < 2e-16 ***
## s(time):LangCont.ordEnglish./i/    4.420   5.018  8.549  < 2e-16 ***
## s(time):LangCont.ordJapanese./i/   5.531   6.249  3.588 0.001264 ** 
## s(time):LangCont.ordEnglish./u/    1.000   1.000 11.538 0.000684 ***
## s(time):LangCont.ordJapanese./u/   5.729   6.669 21.049  < 2e-16 ***
## s(time,speaker)                  251.152 403.000  8.546  < 2e-16 ***
## s(time,word)                      48.151  69.000 17.876  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.831   Deviance explained = 83.5%
## fREML = 7680.8  Scale est. = 0.17263   n = 13398

Residual autocorrelation corrections

We also need to build AR(1) model to account for the residual autocorrelations.

# mark start.event
## arrange the data appropriately
df_dyn_L <- df_dyn_L |> 
  dplyr::arrange(file, time)

## mark time = 0
df_dyn_L$start.event <- df_dyn_L$time == 0

# check residual autocorrelations
itsadug::acf_resid(m10)

# obtain the autocorrelation at lag 1 as a rho value
rho_m10 <- start_value_rho(m10)

Fitting model again

# fitting a model
m11 <- mgcv::bam(f2z ~ LangCont.ord + s(time) + s(time, by = LangCont.ord) + s(time, speaker, bs = "fs", m = 1) + s(time, word, bs = "fs", m = 1), data = df_dyn_L, method = "fREML", rho = rho_m10, AR.start = df_dyn_L$start.event)
## Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
## 1-d smooths of same variable.
# summary
summary(m11)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## f2z ~ LangCont.ord + s(time) + s(time, by = LangCont.ord) + s(time, 
##     speaker, bs = "fs", m = 1) + s(time, word, bs = "fs", m = 1)
## 
## Parametric coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -0.29532    0.05184  -5.697 1.24e-08 ***
## LangCont.ordJapanese./æ/ -0.29656    0.04928  -6.018 1.81e-09 ***
## LangCont.ordEnglish./i/   1.11110    0.05275  21.062  < 2e-16 ***
## LangCont.ordJapanese./i/  1.50725    0.06817  22.111  < 2e-16 ***
## LangCont.ordEnglish./u/  -0.21568    0.05860  -3.681 0.000234 ***
## LangCont.ordJapanese./u/ -0.10876    0.07250  -1.500 0.133603    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                      edf  Ref.df      F  p-value    
## s(time)                            7.573   7.684 21.480  < 2e-16 ***
## s(time):LangCont.ordJapanese./æ/   7.770   8.207 46.032  < 2e-16 ***
## s(time):LangCont.ordEnglish./i/    5.015   5.458  7.728 1.41e-06 ***
## s(time):LangCont.ordJapanese./i/   7.151   7.530  4.539 3.62e-05 ***
## s(time):LangCont.ordEnglish./u/    4.646   5.108  3.095   0.0079 ** 
## s(time):LangCont.ordJapanese./u/   6.802   7.287  8.889  < 2e-16 ***
## s(time,speaker)                  317.927 403.000  6.102  < 2e-16 ***
## s(time,word)                      51.634  69.000 16.127  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.83   Deviance explained = 83.5%
## fREML = 770.17  Scale est. = 0.13281   n = 13398
# check residual autocorrelations
itsadug::acf_resid(m11)

Looks like all smooth and parametric terms are quite important! Let’s quickly do the model comparison for significant testing.

# full model with ML
m11 <- mgcv::bam(f2z ~ LangCont.ord + s(time) + s(time, by = LangCont.ord) + s(time, speaker, bs = "fs", m = 1) + s(time, word, bs = "fs", m = 1), data = df_dyn_L, method = "ML", rho = rho_m10, AR.start = df_dyn_L$start.event)
## Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
## 1-d smooths of same variable.
# nested model with ML
m11_2 <- mgcv::bam(f2z ~ s(time) + s(time, speaker, bs = "fs", m = 1) + s(time, word, bs = "fs", m = 1), data = df_dyn_L, method = "ML", rho = rho_m10, AR.start = df_dyn_L$start.event)
## Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
## 1-d smooths of same variable.
# model comparison
itsadug::compareML(m11, m11_2, suggest.report = TRUE)
## m11: f2z ~ LangCont.ord + s(time) + s(time, by = LangCont.ord) + s(time, 
##     speaker, bs = "fs", m = 1) + s(time, word, bs = "fs", m = 1)
## 
## m11_2: f2z ~ s(time) + s(time, speaker, bs = "fs", m = 1) + s(time, 
##     word, bs = "fs", m = 1)
## 
## Report suggestion: The Chi-Square test on the ML scores indicates that model m11 is [significantly?] better than model m11_2 (X2(15.00)=578.837, p<2e-16).
## -----
##   Model     Score Edf Difference     Df  p.value Sig.
## 1 m11_2 1326.5758   7                                
## 2   m11  747.7391  22    578.837 15.000  < 2e-16  ***
## 
## AIC difference: -989.36, model m11 has lower AIC.

The full model significantly improves the model fit. Let’s further investigate whether this is to do with shape difference.

# full model with ML
m11 <- mgcv::bam(f2z ~ LangCont.ord + s(time) + s(time, by = LangCont.ord) + s(time, speaker, bs = "fs", m = 1) + s(time, word, bs = "fs", m = 1), data = df_dyn_L, method = "ML", rho = rho_m10, AR.start = df_dyn_L$start.event)
## Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
## 1-d smooths of same variable.
# nested model with ML 2 -- parametric term only 
m11_3 <- mgcv::bam(f2z ~ LangCont.ord + s(time) + s(time, speaker, bs = "fs", m = 1) + s(time, word, bs = "fs", m = 1), data = df_dyn_L, method = "ML", rho = rho_m10, AR.start = df_dyn_L$start.event)
## Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
## 1-d smooths of same variable.
# model comparison
itsadug::compareML(m11, m11_3, suggest.report = TRUE)
## m11: f2z ~ LangCont.ord + s(time) + s(time, by = LangCont.ord) + s(time, 
##     speaker, bs = "fs", m = 1) + s(time, word, bs = "fs", m = 1)
## 
## m11_3: f2z ~ LangCont.ord + s(time) + s(time, speaker, bs = "fs", m = 1) + 
##     s(time, word, bs = "fs", m = 1)
## 
## Report suggestion: The Chi-Square test on the ML scores indicates that model m11 is [significantly?] better than model m11_3 (X2(10.00)=407.254, p<2e-16).
## -----
##   Model     Score Edf Difference     Df  p.value Sig.
## 1 m11_3 1154.9934  12                                
## 2   m11  747.7391  22    407.254 10.000  < 2e-16  ***
## 
## AIC difference: -677.36, model m11 has lower AIC.

The full model still improves the degree of model fit at the statistically significant level. This overall suggests that the interaction between language and context is statistically significant on trajectory shape.

Data visualisation

Instead of the plotting fucntions in itsadug, here I use the tidygam package to visualise data while disentangling the language and context effects.

# inspect gams
m11_preds <- tidygam::predict_gam(m11, exclude_terms = c("s(time,speaker)", "s(time,word)"))
## Warning: There was 1 warning in `dplyr::mutate()`.
## ℹ In argument: `fit = rowSums(dplyr::across())`.
## Caused by warning:
## ! Using `across()` without supplying `.cols` was deprecated in dplyr 1.1.0.
## ℹ Please supply `.cols` instead.
# show prediction
m11_preds
## # A tibble: 66 × 6
##    LangCont.ord  time     f2z     se lower_ci upper_ci
##    <ord>        <dbl>   <dbl>  <dbl>    <dbl>    <dbl>
##  1 English./æ/      0 -1.14   0.105    -1.34   -0.931 
##  2 English./i/      0 -0.799  0.280    -1.35   -0.251 
##  3 English./u/      0 -0.915  0.299    -1.50   -0.328 
##  4 Japanese./æ/     0 -0.415  0.220    -0.847   0.0158
##  5 Japanese./i/     0 -0.0140 0.309    -0.621   0.593 
##  6 Japanese./u/     0 -0.383  0.326    -1.02    0.255 
##  7 English./æ/     10 -1.03   0.0845   -1.20   -0.868 
##  8 English./i/     10 -0.548  0.232    -1.00   -0.0930
##  9 English./u/     10 -0.859  0.248    -1.34   -0.373 
## 10 Japanese./æ/    10 -0.407  0.189    -0.777  -0.0376
## # ℹ 56 more rows
# separate two main effects
m11_preds_2 <- tidygam::predict_gam(
  m11,
  length_out = 25,
  exclude_terms = c("s(time,speaker)", "s(time,word)"),
  separate = list(LangCont.ord = c("language", "context"))
)

# get gamm prediction
m11_preds_2 |>
  plot(series = "time", comparison = "language") +
  scale_colour_manual(values = cbPalette) +
  scale_fill_manual(values = cbPalette) +
  facet_grid(cols = vars(context))

Summary

We’ve covered so many things! Here is a brief summary of what has been covered today:

Static data analysis does not always highlight between-group differences accurately.

As our comparison of static and dynamic data visualisation suggests, it is important to think how the data points are sampled for static analysis. It might be the case that two groups happen to be similar when two time-varying contours just cross over.

Generalised Additive Mixed Modelling can model non-linear between-group difference over time

By combining parametric and smooth terms, GAMMs can capture non-linear between-group difference in a response variable over time. We also learnt a modelling approach using difference smooths that can encode the between-group difference directly into the model.

Always visualise the data!

While we have tried interpretting the model summary a few times, it is still quite complex. This is why data visualisation is really important in non-linear modelling like GAMMs.

Wrap-up question

How did you find GAMMs so far? What types of data/research questions might be suitable to be tested through GAMMs?

References

This workshop material is heavily drawn from the great existing tutorials and empirical papers. All of these provide great details in GAMMs modelling and I would highly recommend all of these if you would like to know more about GAMMs.

Chuang, Y.-Y., Fon, J., Papakyritsis, I., Baayen, H., & Ball, M. J. (2021). Analyzing Phonetic Data with Generalized Additive Mixed Models. In M. Ball (Ed.), Manual of Clinical Phonetics (1st ed., Vol. 1, pp. 108–138). Routledge. https://doi.org/10.4324/9780429320903-10

Coretta, S. (2024). Learn Generalised Additive (Mixed) Models. Online tutorial. https://stefanocoretta.github.io/learnGAM/

Kirkham, S, Nance, C., Littlewood, B., Lightfoot, K., & Groarke, E. (2019). Dialect variation in formant dynamics: The acoustics of lateral and vowel sequences in Manchester and Liverpool English. The Journal of Acoustical Society of America, 145(2). 784–794. https://doi.org/10.1121/1.5089886

Sóskuthy, M. (2017). Generalised Additive Mixed Models for dynamic analysis in linguistics: a practical introduction. arXiv:1703.05339 [stat:AP].

Sóskuthy, M., Foulkes, P., Hughes, V., & Haddican, B. (2018). Changing words and sounds: the roles of different cognitive units in sound change. Topics in Cognitive Science, 10, 787–802.

Sóskuthy, M. (2021). Evaluating generalised additive mixed modelling strategies for dynamic speech analysis. Journal of Phonetics, 84, 101017. https://doi.org/10.1016/j.wocn.2020.101017

Steffman, J. (2022). Modeling intonational tunes with Generalized Additive Mixed Models: A practical introduction. Online tutorial. https://jsteffman.github.io/GAMM_intonation_modeling.html

Wieling, M. (2018). Analyzing dynamic phonetic data using generalized additive mixed modeling: A tutorial focusing on articulatory differences between L1 and L2 speakers of English. Journal of Phonetics, 70, 86–116. https://doi.org/10.1016/j.wocn.2018.03.002